Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hang for test using nvidia compiler only for certain smaller MPI counts ERS.hcru_hcru.IELM.pm-cpu_nvidia.elm-multi_inst #6521

Open
ndkeen opened this issue Jul 22, 2024 · 4 comments
Labels
nvidia compiler nvidia compiler (formerly PGI) pm-cpu Perlmutter at NERSC (CPU-only nodes)

Comments

@ndkeen
Copy link
Contributor

ndkeen commented Jul 22, 2024

This looks like a new test -- it is failing on pm-cpu with nvidia compiler. Based on the dates of log files, it looks like
the test is hanging.

Note the current MPI count used by default for this test is 192, which is 1.5 nodes on pm-cpu

@ndkeen ndkeen added pm-cpu Perlmutter at NERSC (CPU-only nodes) nvidia compiler nvidia compiler (formerly PGI) labels Jul 22, 2024
@ndkeen
Copy link
Contributor Author

ndkeen commented Sep 1, 2024

This test has still be failing/hanging. Adding a little more details, these tests seem OK, so it looks like it's combination of the multi_inst modifier with newer nvidia compilers.

These tests pass with nvidia 23.9 as well as 24.5

SMS.hcru_hcru.IELM.pm-cpu_nvidia
SMS_D.hcru_hcru.IELM.pm-cpu_nvidia
ERS.hcru_hcru.IELM.pm-cpu_nvidia

And these tests seem to have the same fail/hang issue:

SMS_D.hcru_hcru.IELM.pm-cpu_nvidia.elm-multi_inst
ERS_D.hcru_hcru.IELM.pm-cpu_nvidia.elm-multi_inst

Where flow might be during hang:

#0  cxip_ep_ctrl_progress_locked (ep_obj=0x11bfff40) at prov/cxi/src/cxip_ctrl.c:373
#1  0x000014d4b2e591dd in cxip_ep_progress (fid=<optimized out>) at prov/cxi/src/cxip_ep.c:186
#2  0x000014d4b2e5e969 in cxip_util_cq_progress (util_cq=0x11bef9d0) at prov/cxi/src/cxip_cq.c:112
#3  0x000014d4b2e3a301 in ofi_cq_readfrom (cq_fid=0x11bef9d0, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:232
#4  0x000014d4b75491b4 in MPIR_Waitall_impl () from /opt/cray/pe/lib64/libmpi_nvidia.so.12
#5  0x000014d4b7595025 in MPIR_Waitall () from /opt/cray/pe/lib64/libmpi_nvidia.so.12
#6  0x000014d4b7595821 in PMPI_Waitall () from /opt/cray/pe/lib64/libmpi_nvidia.so.12
#7  0x000014d4b9030ebe in pmpi_waitall__ () from /opt/cray/pe/lib64/libmpifort_nvidia.so.12
#8  0x00000000026b0766 in m_rearranger::rearrange_ ()
    at /global/cfs/cdirs/e3sm/ndk/repos/pr/ndk_mf_pm-cpu-update-compiler-versions/externals/mct/mct/m_Rearranger.F90:1194
#9  0x000000000090a555 in seq_map_mod::seq_map_map (mapper=..., av_s=..., av_d=..., fldlist=..., 
    norm=<error reading variable: Cannot access memory at address 0x0>, avwts_s=<error reading variable: Location address is not set.>, 
    avwtsfld_s=..., string=..., msgtag=1014, omit_nonlinear=<error reading variable: Cannot access memory at address 0x0>)
    at /global/cfs/cdirs/e3sm/ndk/repos/pr/ndk_mf_pm-cpu-update-compiler-versions/driver-mct/main/seq_map_mod.F90:345
#10 0x0000000000759f9d in component_mod::component_exch (comp=..., flow=..., infodata=..., infodata_string=..., mpicom_barrier=-1006632930, 
    run_barriers=.FALSE., timer_barrier=..., timer_comp_exch=..., timer_map_exch=..., timer_infodata_exch=...)
    at /global/cfs/cdirs/e3sm/ndk/repos/pr/ndk_mf_pm-cpu-update-compiler-versions/driver-mct/main/component_mod.F90:908
#11 0x0000000000742302 in cime_comp_mod::cime_run_lnd_recv_post ()
    at /global/cfs/cdirs/e3sm/ndk/repos/pr/ndk_mf_pm-cpu-update-compiler-versions/driver-mct/main/cime_comp_mod.F90:4301
#12 0x0000000000733818 in cime_comp_mod::cime_run ()
    at /global/cfs/cdirs/e3sm/ndk/repos/pr/ndk_mf_pm-cpu-update-compiler-versions/driver-mct/main/cime_comp_mod.F90:3043
#13 0x000000000074fedc in cime_driver ()
    at /global/cfs/cdirs/e3sm/ndk/repos/pr/ndk_mf_pm-cpu-update-compiler-versions/driver-mct/main/cime_driver.F90:153

@ndkeen
Copy link
Contributor Author

ndkeen commented Sep 1, 2024

OK, there might be an issue with how it's launching more tasks/jobs as if I force the test to land on one node only, it passes. That is: ERS_P128x1.hcru_hcru.IELM.pm-cpu_nvidia.elm-multi_inst

@ndkeen
Copy link
Contributor Author

ndkeen commented Sep 3, 2024

I created PR #6581 to use 3 full nodes (192 MPI's) instead of the current odd value of 192 MPI's. There must have been a reason why I used 192 here -- and indeed search remind me of
#6486

I want to keep this issue open as it's still odd that certain MPI counts cause a hang, while others don't.
Seemingly, only for nvidia compiler.

ndkeen added a commit that referenced this issue Sep 4, 2024
…next (PR #6581)

Currently, the tests for this resolution use 192 MPI's on pm-cpu which is an odd value (1.5 nodes).
Here it's being changed to use -3 (or 384 MPI's).

Example of test that would use this layout: SMS.hcru_hcru.IELM

This change is effective work-around (but not fix) for #6521 with #6486 in mind as noted below.

[bfb]
@ndkeen
Copy link
Contributor Author

ndkeen commented Sep 4, 2024

Merged #6581 so we should not see the issue on cdash.

@ndkeen ndkeen changed the title Hang with a new test using nvidia compiler ERS.hcru_hcru.IELM.pm-cpu_nvidia.elm-multi_inst Hang for test using nvidia compiler only for certain smaller MPI counts ERS.hcru_hcru.IELM.pm-cpu_nvidia.elm-multi_inst Sep 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
nvidia compiler nvidia compiler (formerly PGI) pm-cpu Perlmutter at NERSC (CPU-only nodes)
Projects
None yet
Development

No branches or pull requests

1 participant