research-article

A MultiGPU Performance-Portable Solution for Array Programming Based on Kokkos

Authors:

Pedro Valero-Lara,

Jeffrey S. VetterAuthors Info & Claims

ARRAY 2023: Proceedings of the 9th ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array Programming

Pages 1 - 12

https://doi.org/10.1145/3589246.3595369

Published: 06 June 2023 Publication History

Abstract

Today, multiGPU nodes are widely used in high-performance computing and data centers. However, current programming models do not provide simple, transparent, and portable support for automatically targeting multiple GPUs within a node on application areas of array programming. In this paper, we describe a new application programming interface based on the Kokkos programming model to enable array computation on multiple GPUs in a transparent and portable way across both NVIDIA and AMD GPUs. We implement different variations of this technique to accommodate the exchange of stencils (array boundaries) among different GPU memory spaces, and we provide autotuning to select the proper number of GPUs, depending on the computational cost of the operations to be computed on arrays, that is completely transparent to the programmer. We evaluate our multiGPU extension on Summit (#5 TOP500), with six NVIDIA V100 Volta GPUs per node, and Crusher that contains identical hardware/software as Frontier (#1 TOP500), with four AMD MI250X GPUs, each with 2 Graphics Compute Dies (GCDs)for a total of 8 GCDs per node. We also compare the performance of this solution against the use of MPI + Kokkos, which is the cur-rent de facto solution for multiple GPUs in Kokkos. Our evaluation shows that the new Kokkos solution provides good scalability for many GPUs and a faster and simpler solution (from a programming productivity perspective) than MPI + Kokkos.

References

[1]

David Beckingsale, Richard D. Hornung, Tom Scogland, and Arturo Vargas. 2019. Performance portable C++ programming with RAJA. In Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2019, Washington, DC, USA, February 16-20, 2019, Jefrey K. Hollingsworth and Idit Keidar (Eds.). ACM, 455-456. https://doi.org/10.1145/3293883.3302577

Digital Library

[2]

Reuben D. Budiardja and Christian Y. Cardall. 2019. Targeting GPUs with OpenMP directives on Summit: A simple and efective Fortran experience. Parallel Comput. 88 ( 2019 ). https://doi.org/10.1016/j.parco. 2019.102544

[3]

Sandra Catalán, Xavier Martorell, Jesús Labarta, Tetsuzo Usui, Leonel Antonio Toledo Díaz, and Pedro Valero-Lara. 2019. Accelerating Conjugate Gradient using OmpSs. In 20th International Conference on Parallel and Distributed Computing, Applications and Technologies, PDCAT 2019, Gold Coast, Australia, December 5-7, 2019. IEEE, 121-126. https://doi.org/10.1109/PDCAT46702. 2019.00033

[4]

Tim Cramer, Manoel Römmer, Boris Kosmynin, Erich Focht, and Matthias S. Müller. 2019. OpenMP Target Device Ofloading for the SX-Aurora TSUBASA Vector Engine. In Parallel Processing and Applied Mathematics-13th International Conference, PPAM 2019, Bialystok, Poland, September 8-11, 2019, Revised Selected Papers, Part I (Lecture Notes in Computer Science, Vol. 12043 ), Roman Wyrzykowski, Ewa Deelman, Jack J. Dongarra, and Konrad Karczewski (Eds.). Springer, 237-249. https://doi.org/10.1007/978-3-030-43229-4_21

[5]

Douglas Doerfler and Ron Brightwell. 2006. Measuring MPI Send and Receive Overhead and Application Availability in High Performance Network Interfaces. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, 13th European PVM/MPI User's Group Meeting, Bonn, Germany, September 17-20, 2006, Proceedings (Lecture Notes in Computer Science, Vol. 4192 ), Bernd Mohr, Jesper Larsson Träf, Joachim Worringen, and Jack J. Dongarra (Eds.). Springer, 331-338. https://doi.org/10.1007/11846802_46

Digital Library

[6]

H. Carter Edwards and Daniel Sunderland. 2012. Kokkos Array performance-portable manycore programming model. In Proceedings of the 2012 PPOPP International Workshop on Programming Models and Applications for Multicores and Manycores, PMAM 2012, New Orleans, LA, USA, February 26, 2012, Minyi Guo and Zhiyi Huang (Eds.). ACM, 1-10. https://doi.org/10.1145/2141702.2141703

Digital Library

[7]

H. Carter Edwards, Daniel Sunderland, Vicki Porter, Chris Amsler, and Sam Mish. 2012. Manycore performance-portability: Kokkos multidimensional array library. Sci. Program. 20, 2 ( 2012 ), 89-114. https://doi.org/10.3233/SPR-2012-0343

[8]

H. Carter Edwards, Christian R. Trott, and Daniel Sunderland. 2014. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. J. Parallel Distributed Comput. 74, 12 ( 2014 ), 3202-3216. https://doi.org/10.1016/j.jpdc. 2014. 07.003

[9]

Jan Eichstädt, Martin Vymazal, David Moxey, and Joaquim Peiró. 2020. A comparison of the shared-memory parallel programming models OpenMP, OpenACC and Kokkos in the context of implicit solvers for high-order FEM. Comput. Phys. Commun. 255 ( 2020 ), 107245. https://doi.org/10.1016/j.cpc. 2020.107245

[10]

J. Austin Ellis and Sivasankaran Rajamanickam. 2019. Scalable Inference for Sparse Deep Neural Networks using Kokkos Kernels. In 2019 IEEE High Performance Extreme Computing Conference, HPEC 2019, Waltham, MA, USA, September 24-26, 2019. IEEE, 1-7. https://doi.org/10.1109/HPEC. 2019.8916378

[11]

John Gounley, Madhurima Vardhan, Erik W. Draeger, Pedro ValeroLara, Shirley V. Moore, and Amanda Randles. 2022. Propagation Pattern for Moment Representation of the Lattice Boltzmann Method. IEEE Trans. Parallel Distributed Syst. 33, 3 ( 2022 ), 642-653. https://doi.org/10.1109/TPDS. 2021.3098456

[12]

Rene Halver, Jan H. Meinke, and Godehard Sutmann. 2020. Kokkos implementation of an Ewald Coulomb solver and analysis of performance portability. J. Parallel Distributed Comput. 138 ( 2020 ), 48-54. https://doi.org/10.1016/j.jpdc. 2019. 12.003

[13]

Glen Hansen, Patrick G. Xavier, Sam P. Mish, Thomas E. Voth, Martin W. Heinstein, and Micheal W. Glass. 2016. An MPI+X implementation of contact global search using Kokkos. Eng. Comput. 32, 2 ( 2016 ), 295-311. https://doi.org/10.1007/s00366-015-0418-x

Digital Library

[14]

Stijn Heldens, Pieter Hijma, Ben van Werkhoven, Jason Maassen, and Rob V. van Nieuwpoort. 2022. Lightning: Scaling the GPU Programming Model Beyond a Single GPU. In 2022 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2022, Lyon, France, May 30-June 3, 2022. IEEE, 492-503. https://doi.org/10.1109/IPDPS53621. 2022.00054

[15]

Bálint Joó, Thorsten Kurth, Michael A. Clark, Jeongnim Kim, Christian Robert Trott, Dan Ibanez, Daniel Sunderland, and Jack Deslippe. 2019. Performance Portability of a Wilson Dslash Stencil Operator Mini-App Using Kokkos and SYCL. In 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC, P3HPC@SC 2019, Denver, CO, USA, November 22, 2019. IEEE, 14-25. https://doi.org/10.1109/P3HPC49587. 2019.00007

[16]

Samuel Khuvis, Karen Tomko, Jahanzeb Maqbool Hashmi, and Dhabaleswar K. Panda. 2020. Exploring Hybrid MPI+Kokkos Tasks Programming Model. In 3rd IEEE/ACM Annual Parallel Applications Workshop: Alternatives To MPI+X, PAW-ATM@ SC 2020, Atlanta, GA, USA, November 12, 2020. IEEE, 66-73. https://doi.org/10.1109/PAWATM51920. 2020.00011

[17]

Jiri Kraus. 2015. Multi GPU Programming with MPI and OpenACC. ( 2015 ). https://on-demand.gputechconf.com/gtc/2015/presentation/ S5711-Jiri-Kraus. pdf GPU Technology Conference (GTC).

[18]

Jef Larkin. 2017. Multi-GPU Programming with OpenACC. ( 2017 ). https://on-demand.gputechconf.com/gtc/2017/presentation/ S7546-jef-larkin-multi-gpu-programming-with-openacc. pdf GPU Technology Conference (GTC).

[19]

Kazuaki Matsumura, Simon Garcia de Gonzalo, and Antonio J. Peña. 2021. JACC: An OpenACC Runtime Framework with Kernel-Level and Multi-GPU Parallelization. In 28th IEEE International Conference on High Performance Computing, Data, and Analytics, HiPC 2021, Bengaluru, India, December 17-20, 2021. IEEE, 182-191. https://doi.org/10. 1109/HiPC53243. 2021.00032

[20]

Kazuaki Matsumura, Mitsuhisa Sato, Taisuke Boku, Artur Podobas, and Satoshi Matsuoka. 2018. MACC: An OpenACC Transpiler for Automatic Multi-GPU Use. In Supercomputing Frontiers-4th Asian Conference, SCFA 2018, Singapore, March 26-29, 2018, Proceedings (Lecture Notes in Computer Science, Vol. 10776 ), Rio Yokota and Weigang Wu (Eds.). Springer, 109-127.

[21]

Sivasankaran Rajamanickam, Seher Acer, Luc Berger-Vergiat, Vinh Q. Dang, Nathan D. Ellingwood, Evan Harvey, Brian Kelley, Christian R. Trott, Jeremiah Wilke, and Ichitaro Yamazaki. 2021. Kokkos Kernels: Performance Portable Sparse/Dense Linear Algebra and Graph Kernels. CoRR abs/2103.11991 ( 2021 ). arXiv: 2103.11991 https://arxiv.org/abs/ 2103.11991

[22]

Damodar Sahasrabudhe, Eric T. Phipps, Sivasankaran Rajamanickam, and Martin Berzins. 2019. A Portable SIMD Primitive Using Kokkos for Heterogeneous Architectures. In Accelerator Programming Using Directives-6th International Workshop, WACCPD 2019, Denver, CO, USA, November 18, 2019, Revised Selected Papers (Lecture Notes in Computer Science, Vol. 12017 ), Sandra Wienke and Sridutt Bhalachandra (Eds.). Springer, 140-163. https://doi.org/10.1007/978-3-030-49943-3_7

[23]

Keita Teranishi, Daniel M. Dunlavy, Jeremy M. Myers, and Richard F. Barrett. 2020. SparTen: Leveraging Kokkos for On-node Parallelism in a Second-Order Method for Fitting Canonical Polyadic Tensor Models to Poisson Data. In 2020 IEEE High Performance Extreme Computing Conference, HPEC 2020, Waltham, MA, USA, September 22-24, 2020. IEEE, 1-7. https://doi.org/10.1109/HPEC43674. 2020.9286251

[24]

Leonel Toledo, Pedro Valero-Lara, Jefrey S. Vetter, and Antonio J. Peña. 2021. Static Graphs for Coding Productivity in OpenACC. In 28th IEEE International Conference on High Performance Computing, Data, and Analytics, HiPC 2021, Bengaluru, India, December 17-20, 2021. IEEE, 364-369. https://doi.org/10.1109/HiPC53243. 2021.00050

[25]

Leonel Toledo, Pedro Valero-Lara, Jefrey S. Vetter, and Antonio J. Peña. 2021. Static Graphs for Coding Productivity in OpenACC. In 28th IEEE International Conference on High Performance Computing, Data, and Analytics, HiPC 2021, Bengaluru, India, December 17-20, 2021. IEEE, 364-369. https://doi.org/10.1109/HiPC53243. 2021.00050

[26]

Pedro Valero-Lara and Johan Jansson. 2017. Heterogeneous CPU+GPU approaches for mesh refinement over Lattice-Boltzmann simulations. Concurr. Comput. Pract. Exp. 29, 7 ( 2017 ). https://doi.org/10.1002/cpe. 3919

[27]

Pedro Valero-Lara, Jungwon Kim, Oscar Hernandez, and Jefrey S. Vetter. 2021. OpenMP Target Task: Tasking and Target Ofloading on Heterogeneous Systems. In Euro-Par 2021: Parallel Processing Workshops-Euro-Par 2021 International Workshops, Lisbon, Portugal, August 30-31, 2021, Revised Selected Papers (Lecture Notes in Computer Science, Vol. 13098 ), Ricardo Chaves, Dora B. Heras, Aleksandar Ilic, Didem Unat, Rosa M. Badia, Andrea Bracciali, Patrick Diehl, Anshu Dubey, Oh Sangyoon, Stephen L. Scott, and Laura Ricci (Eds.). Springer, 445-455. https://doi.org/10.1007/978-3-031-06156-1_35

Digital Library

[28]

Pedro Valero-Lara, Ezhilmathi Krishnasamy, and Johan Jansson. 2017. Towards HPC-Embedded. Case Study: Kalray and Message-Passing on NoC. Scalable Comput. Pract. Exp. 18, 2 ( 2017 ), 151-160. https://doi.org/10.12694/scpe.v18i2. 1287

[29]

Pedro Valero-Lara, Seyong Lee, Marc González Tallada, Joel E. Denny, and Jefrey S. Vetter. 2022. KokkACC: Enhancing Kokkos with OpenACC. In 9th Workshop on Accelerator Programming Using Directives, WACCPD@SC 2022, Dallas, TX, USA, November 13-18, 2022. IEEE, 32-42. https://doi.org/10.1109/WACCPD56842. 2022.00009

[30]

Michael M. Wolf, H. Carter Edwards, and Stephen L. Olivier. 2016. Kokkos/Qthreads task-parallel approach to linear algebra based graph analytics. In 2016 IEEE High Performance Extreme Computing Conference, HPEC 2016, Waltham, MA, USA, September 13-15, 2016. IEEE, 1-7. https://doi.org/10.1109/HPEC. 2016.7761649

[31]

Michael Wolfe. 2014. Scaling OpenACC Applications accross multiple GPUs. ( 2014 ). https://on-demand.gputechconf.com/gtc/2014/ presentations/S4474-scaling-openacc-across-multiple-gpus. pdf GPU Technology Conference (GTC).

Cited By

Index Terms

A MultiGPU Performance-Portable Solution for Array Programming Based on Kokkos
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms
    2. Parallel programming languages

Recommendations

On the Performance Portability of OpenACC, OpenMP, Kokkos and RAJA
HPCAsia '22: International Conference on High Performance Computing in Asia-Pacific Region

Performance Portability frameworks are becoming more central and essential in heterogeneous computing systems. However, the developer toolbox lacks the tools to assess the performance portability degree of these frameworks.

This article presents a new ...
Stellar Mergers with HPX-Kokkos and SYCL: Methods of using an Asynchronous Many-Task Runtime System with SYCL
IWOCL '23: Proceedings of the 2023 International Workshop on OpenCL

Ranging from NVIDIA GPUs to AMD GPUs and Intel GPUs: Given the heterogeneity of available accelerator cards within current supercomputers, portability is a key aspect for modern HPC applications. In Octo-Tiger, an astrophysics application simulating ...
Kokkos

The manycore revolution can be characterized by increasing thread counts, decreasing memory per thread, and diversity of continually evolving manycore architectures. High performance computing (HPC) applications and libraries must exploit increasingly ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ARRAY 2023: Proceedings of the 9th ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array Programming

June 2023

74 pages

ISBN:9798400701696

DOI:10.1145/3589246

General Chairs:
Troels Henriksen
University of Copenhagen, Denmark
,
Artjoms Sinkarovs
Heriot-Watt University, UK

Copyright © 2023 ACM.

Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allowothers to do so, for Government purposes only.

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 June 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

US Department of Energy

Conference

ARRAY '23

Sponsor:

SIGPLAN

ARRAY '23: 9th ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array Programming

June 18, 2023

FL, Orlando, USA

Acceptance Rates

Overall Acceptance Rate 17 of 25 submissions, 68%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
135
Total Downloads

Downloads (Last 12 months)68
Downloads (Last 6 weeks)5

Reflects downloads up to 14 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents