skip to main content
10.1145/3589246.3595369acmconferencesArticle/Chapter ViewAbstractPublication PagespldiConference Proceedingsconference-collections
research-article

A MultiGPU Performance-Portable Solution for Array Programming Based on Kokkos

Published: 06 June 2023 Publication History
  • Get Citation Alerts
  • Abstract

    Today, multiGPU nodes are widely used in high-performance computing and data centers. However, current programming models do not provide simple, transparent, and portable support for automatically targeting multiple GPUs within a node on application areas of array programming. In this paper, we describe a new application programming interface based on the Kokkos programming model to enable array computation on multiple GPUs in a transparent and portable way across both NVIDIA and AMD GPUs. We implement different variations of this technique to accommodate the exchange of stencils (array boundaries) among different GPU memory spaces, and we provide autotuning to select the proper number of GPUs, depending on the computational cost of the operations to be computed on arrays, that is completely transparent to the programmer. We evaluate our multiGPU extension on Summit (#5 TOP500), with six NVIDIA V100 Volta GPUs per node, and Crusher that contains identical hardware/software as Frontier (#1 TOP500), with four AMD MI250X GPUs, each with 2 Graphics Compute Dies (GCDs)for a total of 8 GCDs per node. We also compare the performance of this solution against the use of MPI + Kokkos, which is the cur-rent de facto solution for multiple GPUs in Kokkos. Our evaluation shows that the new Kokkos solution provides good scalability for many GPUs and a faster and simpler solution (from a programming productivity perspective) than MPI + Kokkos.

    References

    [1]
    David Beckingsale, Richard D. Hornung, Tom Scogland, and Arturo Vargas. 2019. Performance portable C++ programming with RAJA. In Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2019, Washington, DC, USA, February 16-20, 2019, Jefrey K. Hollingsworth and Idit Keidar (Eds.). ACM, 455-456. https://doi.org/10.1145/3293883.3302577
    [2]
    Reuben D. Budiardja and Christian Y. Cardall. 2019. Targeting GPUs with OpenMP directives on Summit: A simple and efective Fortran experience. Parallel Comput. 88 ( 2019 ). https://doi.org/10.1016/j.parco. 2019.102544
    [3]
    Sandra Catalán, Xavier Martorell, Jesús Labarta, Tetsuzo Usui, Leonel Antonio Toledo Díaz, and Pedro Valero-Lara. 2019. Accelerating Conjugate Gradient using OmpSs. In 20th International Conference on Parallel and Distributed Computing, Applications and Technologies, PDCAT 2019, Gold Coast, Australia, December 5-7, 2019. IEEE, 121-126. https://doi.org/10.1109/PDCAT46702. 2019.00033
    [4]
    Tim Cramer, Manoel Römmer, Boris Kosmynin, Erich Focht, and Matthias S. Müller. 2019. OpenMP Target Device Ofloading for the SX-Aurora TSUBASA Vector Engine. In Parallel Processing and Applied Mathematics-13th International Conference, PPAM 2019, Bialystok, Poland, September 8-11, 2019, Revised Selected Papers, Part I (Lecture Notes in Computer Science, Vol. 12043 ), Roman Wyrzykowski, Ewa Deelman, Jack J. Dongarra, and Konrad Karczewski (Eds.). Springer, 237-249. https://doi.org/10.1007/978-3-030-43229-4_21
    [5]
    Douglas Doerfler and Ron Brightwell. 2006. Measuring MPI Send and Receive Overhead and Application Availability in High Performance Network Interfaces. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, 13th European PVM/MPI User's Group Meeting, Bonn, Germany, September 17-20, 2006, Proceedings (Lecture Notes in Computer Science, Vol. 4192 ), Bernd Mohr, Jesper Larsson Träf, Joachim Worringen, and Jack J. Dongarra (Eds.). Springer, 331-338. https://doi.org/10.1007/11846802_46
    [6]
    H. Carter Edwards and Daniel Sunderland. 2012. Kokkos Array performance-portable manycore programming model. In Proceedings of the 2012 PPOPP International Workshop on Programming Models and Applications for Multicores and Manycores, PMAM 2012, New Orleans, LA, USA, February 26, 2012, Minyi Guo and Zhiyi Huang (Eds.). ACM, 1-10. https://doi.org/10.1145/2141702.2141703
    [7]
    H. Carter Edwards, Daniel Sunderland, Vicki Porter, Chris Amsler, and Sam Mish. 2012. Manycore performance-portability: Kokkos multidimensional array library. Sci. Program. 20, 2 ( 2012 ), 89-114. https://doi.org/10.3233/SPR-2012-0343
    [8]
    H. Carter Edwards, Christian R. Trott, and Daniel Sunderland. 2014. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. J. Parallel Distributed Comput. 74, 12 ( 2014 ), 3202-3216. https://doi.org/10.1016/j.jpdc. 2014. 07.003
    [9]
    Jan Eichstädt, Martin Vymazal, David Moxey, and Joaquim Peiró. 2020. A comparison of the shared-memory parallel programming models OpenMP, OpenACC and Kokkos in the context of implicit solvers for high-order FEM. Comput. Phys. Commun. 255 ( 2020 ), 107245. https://doi.org/10.1016/j.cpc. 2020.107245
    [10]
    J. Austin Ellis and Sivasankaran Rajamanickam. 2019. Scalable Inference for Sparse Deep Neural Networks using Kokkos Kernels. In 2019 IEEE High Performance Extreme Computing Conference, HPEC 2019, Waltham, MA, USA, September 24-26, 2019. IEEE, 1-7. https://doi.org/10.1109/HPEC. 2019.8916378
    [11]
    John Gounley, Madhurima Vardhan, Erik W. Draeger, Pedro ValeroLara, Shirley V. Moore, and Amanda Randles. 2022. Propagation Pattern for Moment Representation of the Lattice Boltzmann Method. IEEE Trans. Parallel Distributed Syst. 33, 3 ( 2022 ), 642-653. https://doi.org/10.1109/TPDS. 2021.3098456
    [12]
    Rene Halver, Jan H. Meinke, and Godehard Sutmann. 2020. Kokkos implementation of an Ewald Coulomb solver and analysis of performance portability. J. Parallel Distributed Comput. 138 ( 2020 ), 48-54. https://doi.org/10.1016/j.jpdc. 2019. 12.003
    [13]
    Glen Hansen, Patrick G. Xavier, Sam P. Mish, Thomas E. Voth, Martin W. Heinstein, and Micheal W. Glass. 2016. An MPI+X implementation of contact global search using Kokkos. Eng. Comput. 32, 2 ( 2016 ), 295-311. https://doi.org/10.1007/s00366-015-0418-x
    [14]
    Stijn Heldens, Pieter Hijma, Ben van Werkhoven, Jason Maassen, and Rob V. van Nieuwpoort. 2022. Lightning: Scaling the GPU Programming Model Beyond a Single GPU. In 2022 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2022, Lyon, France, May 30-June 3, 2022. IEEE, 492-503. https://doi.org/10.1109/IPDPS53621. 2022.00054
    [15]
    Bálint Joó, Thorsten Kurth, Michael A. Clark, Jeongnim Kim, Christian Robert Trott, Dan Ibanez, Daniel Sunderland, and Jack Deslippe. 2019. Performance Portability of a Wilson Dslash Stencil Operator Mini-App Using Kokkos and SYCL. In 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC, P3HPC@SC 2019, Denver, CO, USA, November 22, 2019. IEEE, 14-25. https://doi.org/10.1109/P3HPC49587. 2019.00007
    [16]
    Samuel Khuvis, Karen Tomko, Jahanzeb Maqbool Hashmi, and Dhabaleswar K. Panda. 2020. Exploring Hybrid MPI+Kokkos Tasks Programming Model. In 3rd IEEE/ACM Annual Parallel Applications Workshop: Alternatives To MPI+X, PAW-ATM@ SC 2020, Atlanta, GA, USA, November 12, 2020. IEEE, 66-73. https://doi.org/10.1109/PAWATM51920. 2020.00011
    [17]
    Jiri Kraus. 2015. Multi GPU Programming with MPI and OpenACC. ( 2015 ). https://on-demand.gputechconf.com/gtc/2015/presentation/ S5711-Jiri-Kraus. pdf GPU Technology Conference (GTC).
    [18]
    Jef Larkin. 2017. Multi-GPU Programming with OpenACC. ( 2017 ). https://on-demand.gputechconf.com/gtc/2017/presentation/ S7546-jef-larkin-multi-gpu-programming-with-openacc. pdf GPU Technology Conference (GTC).
    [19]
    Kazuaki Matsumura, Simon Garcia de Gonzalo, and Antonio J. Peña. 2021. JACC: An OpenACC Runtime Framework with Kernel-Level and Multi-GPU Parallelization. In 28th IEEE International Conference on High Performance Computing, Data, and Analytics, HiPC 2021, Bengaluru, India, December 17-20, 2021. IEEE, 182-191. https://doi.org/10. 1109/HiPC53243. 2021.00032
    [20]
    Kazuaki Matsumura, Mitsuhisa Sato, Taisuke Boku, Artur Podobas, and Satoshi Matsuoka. 2018. MACC: An OpenACC Transpiler for Automatic Multi-GPU Use. In Supercomputing Frontiers-4th Asian Conference, SCFA 2018, Singapore, March 26-29, 2018, Proceedings (Lecture Notes in Computer Science, Vol. 10776 ), Rio Yokota and Weigang Wu (Eds.). Springer, 109-127.
    [21]
    Sivasankaran Rajamanickam, Seher Acer, Luc Berger-Vergiat, Vinh Q. Dang, Nathan D. Ellingwood, Evan Harvey, Brian Kelley, Christian R. Trott, Jeremiah Wilke, and Ichitaro Yamazaki. 2021. Kokkos Kernels: Performance Portable Sparse/Dense Linear Algebra and Graph Kernels. CoRR abs/2103.11991 ( 2021 ). arXiv: 2103.11991 https://arxiv.org/abs/ 2103.11991
    [22]
    Damodar Sahasrabudhe, Eric T. Phipps, Sivasankaran Rajamanickam, and Martin Berzins. 2019. A Portable SIMD Primitive Using Kokkos for Heterogeneous Architectures. In Accelerator Programming Using Directives-6th International Workshop, WACCPD 2019, Denver, CO, USA, November 18, 2019, Revised Selected Papers (Lecture Notes in Computer Science, Vol. 12017 ), Sandra Wienke and Sridutt Bhalachandra (Eds.). Springer, 140-163. https://doi.org/10.1007/978-3-030-49943-3_7
    [23]
    Keita Teranishi, Daniel M. Dunlavy, Jeremy M. Myers, and Richard F. Barrett. 2020. SparTen: Leveraging Kokkos for On-node Parallelism in a Second-Order Method for Fitting Canonical Polyadic Tensor Models to Poisson Data. In 2020 IEEE High Performance Extreme Computing Conference, HPEC 2020, Waltham, MA, USA, September 22-24, 2020. IEEE, 1-7. https://doi.org/10.1109/HPEC43674. 2020.9286251
    [24]
    Leonel Toledo, Pedro Valero-Lara, Jefrey S. Vetter, and Antonio J. Peña. 2021. Static Graphs for Coding Productivity in OpenACC. In 28th IEEE International Conference on High Performance Computing, Data, and Analytics, HiPC 2021, Bengaluru, India, December 17-20, 2021. IEEE, 364-369. https://doi.org/10.1109/HiPC53243. 2021.00050
    [25]
    Leonel Toledo, Pedro Valero-Lara, Jefrey S. Vetter, and Antonio J. Peña. 2021. Static Graphs for Coding Productivity in OpenACC. In 28th IEEE International Conference on High Performance Computing, Data, and Analytics, HiPC 2021, Bengaluru, India, December 17-20, 2021. IEEE, 364-369. https://doi.org/10.1109/HiPC53243. 2021.00050
    [26]
    Pedro Valero-Lara and Johan Jansson. 2017. Heterogeneous CPU+GPU approaches for mesh refinement over Lattice-Boltzmann simulations. Concurr. Comput. Pract. Exp. 29, 7 ( 2017 ). https://doi.org/10.1002/cpe. 3919
    [27]
    Pedro Valero-Lara, Jungwon Kim, Oscar Hernandez, and Jefrey S. Vetter. 2021. OpenMP Target Task: Tasking and Target Ofloading on Heterogeneous Systems. In Euro-Par 2021: Parallel Processing Workshops-Euro-Par 2021 International Workshops, Lisbon, Portugal, August 30-31, 2021, Revised Selected Papers (Lecture Notes in Computer Science, Vol. 13098 ), Ricardo Chaves, Dora B. Heras, Aleksandar Ilic, Didem Unat, Rosa M. Badia, Andrea Bracciali, Patrick Diehl, Anshu Dubey, Oh Sangyoon, Stephen L. Scott, and Laura Ricci (Eds.). Springer, 445-455. https://doi.org/10.1007/978-3-031-06156-1_35
    [28]
    Pedro Valero-Lara, Ezhilmathi Krishnasamy, and Johan Jansson. 2017. Towards HPC-Embedded. Case Study: Kalray and Message-Passing on NoC. Scalable Comput. Pract. Exp. 18, 2 ( 2017 ), 151-160. https://doi.org/10.12694/scpe.v18i2. 1287
    [29]
    Pedro Valero-Lara, Seyong Lee, Marc González Tallada, Joel E. Denny, and Jefrey S. Vetter. 2022. KokkACC: Enhancing Kokkos with OpenACC. In 9th Workshop on Accelerator Programming Using Directives, WACCPD@SC 2022, Dallas, TX, USA, November 13-18, 2022. IEEE, 32-42. https://doi.org/10.1109/WACCPD56842. 2022.00009
    [30]
    Michael M. Wolf, H. Carter Edwards, and Stephen L. Olivier. 2016. Kokkos/Qthreads task-parallel approach to linear algebra based graph analytics. In 2016 IEEE High Performance Extreme Computing Conference, HPEC 2016, Waltham, MA, USA, September 13-15, 2016. IEEE, 1-7. https://doi.org/10.1109/HPEC. 2016.7761649
    [31]
    Michael Wolfe. 2014. Scaling OpenACC Applications accross multiple GPUs. ( 2014 ). https://on-demand.gputechconf.com/gtc/2014/ presentations/S4474-scaling-openacc-across-multiple-gpus. pdf GPU Technology Conference (GTC).

    Cited By

    View all

    Index Terms

    1. A MultiGPU Performance-Portable Solution for Array Programming Based on Kokkos

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        ARRAY 2023: Proceedings of the 9th ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array Programming
        June 2023
        74 pages
        ISBN:9798400701696
        DOI:10.1145/3589246
        Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allowothers to do so, for Government purposes only.

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 06 June 2023

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Array programming
        2. C++ metaprogramming
        3. CUDA
        4. HIP
        5. Kokkos
        6. MPI
        7. autotuning
        8. multiGPU
        9. parallel programming model
        10. performance portability
        11. productivity

        Qualifiers

        • Research-article

        Funding Sources

        • US Department of Energy

        Conference

        ARRAY '23
        Sponsor:

        Acceptance Rates

        Overall Acceptance Rate 17 of 25 submissions, 68%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 135
          Total Downloads
        • Downloads (Last 12 months)68
        • Downloads (Last 6 weeks)5
        Reflects downloads up to 14 Aug 2024

        Other Metrics

        Citations

        Cited By

        View all

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media