skip to main content
10.1145/3437359.3465582acmconferencesArticle/Chapter ViewAbstractPublication PagespearcConference Proceedingsconference-collections
research-article
Public Access

INAM: Cross-stack Profiling and Analysis of Communication in MPI-based Applications

Published: 17 July 2021 Publication History

Abstract

Understanding the full-stack performance trade-offs and interplay among HPC applications, MPI libraries, the communication fabric, and the job scheduler is a challenging endeavor. Unfortunately, existing profiling tools are disjoint and only focus on profiling one or a few levels of the HPC stack limiting the insights they can provide. In this paper, we propose a standardized approach to facilitate near real-time, low overhead performance characterization, profiling, and evaluation of communication of high-performance communication middleware as well as scientific applications using a cross-stack approach by INAM. The profiling capabilities are supported in two modes of with and without modifications to the application depending on the scope of the profiling session. We design and implement our designs using an MPI_T-based standardized method to obtain near real-time insights for MPI applications at scales of up to 4,096 processes with less than 5% overhead. Through experimental evaluations of increasing batch size for DL training, we demonstrate novel benefits of INAM for cross-stack communication analysis in real-time to detect bottlenecks and resolve them, achieving up to 3.6x improvements for the use-case study. The proposed solutions have been publicly released with the latest version of INAM and currently being used in production at various HPC supercomputers.

References

[1]
[n.d.]. Horovod: Distributed training framework for TensorFlow. https://github.com/uber/horovod.
[2]
[n.d.]. Integrated Performance Monitoring (IPM). http://ipm-hpc.sourceforge.net/.
[3]
[n.d.]. mpiP: Lightweight, Scalable MPI Profiling. http://www.llnl.gov/CASC/mpip/.
[4]
[n.d.]. Performance Co-Pilot. https://pcp.io.
[5]
[n.d.]. Prometheus exporter. https://github.com/prometheus/node_exporter.
[6]
P. Kousha, S. D. Kamal Raj, H. Subramoni, D. Panda, H. Na, T. Dockendorf, K. Tomko. 2020. Accelerated Real-time Network Monitoring and Profiling at Scale using OSU INAM. In Practice and Experience in Advanced Research Computing (PEARC 2020) (Portland, Oregon, USA).
[7]
Anthony Agelastos, Benjamin Allan, Jim Brandt, Paul Cassella, Jeremy Enos, Joshi Fullop, Ann Gentile, Steve Monk, Nichamon Naksinehaboon, Jeff Ogden, Mahesh Rajan, Michael Showerman, Joel Stevenson, Narate Taerat, and Tom Tucker. 2014. The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications(SC ’14). IEEE Press, Piscataway, NJ, USA, 154–165. https://doi.org/10.1109/SC.2014.18
[8]
Apache Foundation. [n.d.]. MxNet: A Flexible and Effcient Library for Deep Learning. https://mxnet.apache.org/.
[9]
ARM Holdings. [n.d.]. ARM MAP. https://www.arm.com/products/development-tools/server-and-hpc/forge/map.
[10]
Ammar Ahmad Awan, Jeroen Bédorf, Ching-Hsiang Chu, Hari Subramoni, and Dhabaleswar K. Panda. 2019. Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation. In The 19th Annual IEEE/ACM International Symposium in Cluster, Cloud, and Grid Computing (CCGRID 2019).
[11]
A. A. Awan, A. Jain, C. Chu, H. Subramoni, and D. K. Panda. 2020. Communication Profiling and Characterization of Deep-Learning Workloads on Clusters With High-Performance Interconnects. IEEE Micro 40, 1 (2020), 35–43.
[12]
B. Barth, T. Evans and J. McCalpin. [n.d.]. TACC STATS. https://www.tacc.utexas.edu/research-development/tacc-projects/tacc-stats.
[13]
Barcelona Supercomputing Center. [n.d.]. Paraver. http://www.bsc.es/computer-sciences/performance-tools/paraver.
[14]
CaRCC [n.d.]. CaRCC - Campus Research Computing Consortium. https://carcc.org/.
[15]
Bengisu Elis, Dai Yang, and Martin Schulz. 2019. QMPI: A next Generation MPI Profiling Interface for Modern HPC Platforms. In Proceedings of the 26th European MPI Users’ Group Meeting (Zürich, Switzerland) (EuroMPI ’19). ACM, New York, NY, USA, Article 4, 10 pages. https://doi.org/10.1145/3343211.3343215
[16]
HorovodRunner [n.d.]. HorovodRunner: distributed deep learning with Horovod. https://docs.databricks.com/applications/machine-learning/train-model/distributed-training/horovod-runner.html.
[17]
[17] HPCToolkit.2019. http://hpctoolkit.org/Accessed: 2021/06/11 14:07:46.
[18]
Intel. [n.d.]. Intel Trace Analyzer and Collector. https://software.intel.com/en-us/trace-analyzer.
[19]
Intel Corporation. [n.d.]. Intel VTune Amplifier. https://software.intel.com/en-us/intel-vtune-amplifier-xe.
[20]
Rainer Keller, George Bosilca, Graham Fagg, Michael Resch, and Jack J. Dongarra. 2006. Implementation and Usage of the PERUSE-Interface in Open MPI. In Proceedings, 13th European PVM/MPI Users’ Group Meeting(Lecture Notes in Computer Science). Springer-Verlag, Bonn, Germany.
[21]
P. Kousha, B. Ramesh, K. Kandadi Suresh, C. Chu, A. Jain, N. Sarkauskas, H. Subramoni, and D. K. Panda. 2019. Designing a Profiling and Visualization Tool for Scalable and In-depth Analysis of High-Performance GPU Clusters. In 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics. 93–102.
[22]
Lawrence Livermore National Laboratory. [n.d.]. PAVE: Performance Analysis and Visualization at Exascale. https://computation.llnl.gov/project/performance-analysis-through-visualization/software.php.
[23]
Allen D. Malony and Sameer Shende. 2000. Performance Technology for Complex Parallel and Distributed Systems. In Proc. DAPSYS 2000, G. Kotsis and P. Kacsuk (Eds). 37–46.
[24]
[24] Mellanox Integrated Switch Management Solution.[n.d.]. http://www.mellanox.com/page/ib_fabricit_efm_management.
[25]
[25] Message Passing Interface Forum.[n.d.]. http://www.mpi-forum.org/Accessed: 2021/06/11 14:07:46.
[26]
[26] OSU InfiniBand Network Analysis and Monitoring Tool.[n.d.]. http://mvapich.cse.ohio-state.edu/tools/osu-inam/.
[27]
J. T. Palmer, S. M. Gallo, T. R. Furlani, M. D. Jones, R. L. DeLeon, J. P. White, N. Simakov, A. K. Patra, J. Sperhac, T. Yearke, R. Rathsam, M. Innus, C. D. Cornelius, J. C. Browne, W. L. Barth, and R. T. Evans. 2015. Open XDMoD: A Tool for the Comprehensive Management of High-Performance Computing Resources. Computing in Science Engineering 17, 4 (July 2015), 52–62. https://doi.org/10.1109/MCSE.2015.68
[28]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. arxiv:1912.01703 [cs.LG]
[29]
Martin Schulz and Bronis R De Supinski. 2007. PN MPI tools: A whole lot greater than the sum of their parts. In Proceedings of the 2007 ACM/IEEE conference on Supercomputing. 1–10.
[30]
Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. CoRR abs/1802.05799(2018). arxiv:1802.05799http://arxiv.org/abs/1802.05799
[31]
Virtual Institute - High Productivity Supercomputing. [n.d.]. HOPSA: A Holistic Performance System Analysis. http://www.vi-hps.org/projects/hopsa/overview.
[32]
xsede [n.d.]. XSEDE - Extreme Science and Engineering Discovery Environment. https://www.xsede.org/.

Cited By

View all
  • (2024)TinyProf: Towards Continuous Performance Introspection through Scalable Parallel I/OISC High Performance 2024 Research Paper Proceedings (39th International Conference)10.23919/ISC.2024.10528932(1-12)Online publication date: May-2024
  • (2024)Analysis and prediction of performance variability in large-scale computing systemsThe Journal of Supercomputing10.1007/s11227-024-06040-w80:10(14978-15005)Online publication date: 28-Mar-2024
  • (2023)DPU-Bench: A Micro-Benchmark Suite to Measure Offload Efficiency Of SmartNICsPractice and Experience in Advanced Research Computing 2023: Computing for the Common Good10.1145/3569951.3593595(94-101)Online publication date: 23-Jul-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PEARC '21: Practice and Experience in Advanced Research Computing 2021: Evolution Across All Dimensions
July 2021
310 pages
ISBN:9781450382922
DOI:10.1145/3437359
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 July 2021

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

PEARC '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 133 of 202 submissions, 66%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)240
  • Downloads (Last 6 weeks)49
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)TinyProf: Towards Continuous Performance Introspection through Scalable Parallel I/OISC High Performance 2024 Research Paper Proceedings (39th International Conference)10.23919/ISC.2024.10528932(1-12)Online publication date: May-2024
  • (2024)Analysis and prediction of performance variability in large-scale computing systemsThe Journal of Supercomputing10.1007/s11227-024-06040-w80:10(14978-15005)Online publication date: 28-Mar-2024
  • (2023)DPU-Bench: A Micro-Benchmark Suite to Measure Offload Efficiency Of SmartNICsPractice and Experience in Advanced Research Computing 2023: Computing for the Common Good10.1145/3569951.3593595(94-101)Online publication date: 23-Jul-2023
  • (2023)Accelerating Distributed Deep Learning Training with Compression Assisted Allgather and Reduce-Scatter Communication2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS54959.2023.00023(134-144)Online publication date: May-2023
  • (2023)SAI: AI-Enabled Speech Assistant Interface for Science Gateways in HPCHigh Performance Computing10.1007/978-3-031-32041-5_21(402-424)Online publication date: 21-May-2023
  • (2023)Illuminating the I/O Optimization Path of Scientific ApplicationsHigh Performance Computing10.1007/978-3-031-32041-5_2(22-41)Online publication date: 21-May-2023
  • (2022)An Analysis of Long-Tailed Network Latency Distribution and Background Traffic on Dragonfly+Benchmarking, Measuring, and Optimizing10.1007/978-3-031-31180-2_8(123-142)Online publication date: 7-Nov-2022
  • (2022)“Hey CAI” - Conversational AI Enabled User Interface for HPC ToolsHigh Performance Computing10.1007/978-3-031-07312-0_5(87-108)Online publication date: 29-May-2022
  • (2022)Accelerating MPI All-to-All Communication with Online Compression on Modern GPU ClustersHigh Performance Computing10.1007/978-3-031-07312-0_1(3-25)Online publication date: 29-May-2022

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media