Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For pm-cpu, increase default pelayout to 3 nodes for tests using l%360x720cru #6581

Merged
merged 2 commits into from
Sep 4, 2024

Conversation

ndkeen
Copy link
Contributor

@ndkeen ndkeen commented Sep 3, 2024

Currently, the tests for this resolution use 192 MPI's on pm-cpu which is an odd value (1.5 nodes).
Here it's being changed to use -3 (or 384 MPI's).

Example of test that would use this layout: SMS.hcru_hcru.IELM

This change is effective work-around (but not fix) for #6521 with #6486 in mind as noted below.

[bfb]

@ndkeen ndkeen added Machine Files pm-cpu Perlmutter at NERSC (CPU-only nodes) labels Sep 3, 2024
@ndkeen ndkeen self-assigned this Sep 3, 2024
@ndkeen
Copy link
Contributor Author

ndkeen commented Sep 3, 2024

This change does not resolve #6521, but a clean work-around.
There still seem to be certain MPI counts that trip up a certain test with nvidia compiler:
ERS.hcru_hcru.IELM.pm-cpu_nvidia.elm-multi_inst

Before going further, I will try the following 3 tests (all I see in cime_config/tests.py using hcru) with intel, gnu, and nvidia compiler with the updated 256 tasks:

            "ERS.hcru_hcru.IELM.elm-multi_inst",
            "SMS_Ld1.hcru_hcru.I1850CRUELMCN",
            "ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.elm-erosion",

Doing these tests reminded me of #6486
which is I think why I had been using 192 MPI's. The erosion test above causes the issue. It seems that using 384 NTASKS (3 full pm-cpu nodes) might be the best option here. Using 128 MPI's (1 node) may also work in all cases as well, but might favor using more for now.

@ndkeen ndkeen changed the title For pm-cpu, use 2 full nodes for default tests using l%360x720cru For pm-cpu, use 2 full nodes for default tests using l%360x720cru Sep 3, 2024
Copy link

github-actions bot commented Sep 3, 2024

PR Preview Action v1.4.7
🚀 Deployed preview to https://E3SM-Project.github.io/E3SM/pr-preview/pr-6581/
on branch gh-pages at 2024-09-03 20:03 UTC

@ndkeen ndkeen changed the title For pm-cpu, use 2 full nodes for default tests using l%360x720cru For pm-cpu, increase default pelayout to 3 nodes for tests using l%360x720cru Sep 3, 2024
@ndkeen ndkeen added the nvidia compiler nvidia compiler (formerly PGI) label Sep 3, 2024
ndkeen added a commit that referenced this pull request Sep 4, 2024
…next (PR #6581)

Currently, the tests for this resolution use 192 MPI's on pm-cpu which is an odd value (1.5 nodes).
Here it's being changed to use -3 (or 384 MPI's).

Example of test that would use this layout: SMS.hcru_hcru.IELM

This change is effective work-around (but not fix) for #6521 with #6486 in mind as noted below.

[bfb]
@ndkeen
Copy link
Contributor Author

ndkeen commented Sep 4, 2024

merged to next

@ndkeen ndkeen merged commit 02ace06 into master Sep 4, 2024
21 checks passed
@ndkeen ndkeen deleted the ndk/machinefiles/pm-cpu-adjust-pelayout-for-hcru branch September 4, 2024 16:58
@ndkeen
Copy link
Contributor Author

ndkeen commented Sep 4, 2024

I merged to master after seeing [ERS.hcru_hcru.IELM.pm-cpu_nvidia.elm-multi_inst](https://my.cdash.org/test/198193537) pass and not seeing any other issues. There were several (5 total in 2 suites) expected NML diffs with tests using these new pelayout and I've created bless requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Machine Files nvidia compiler nvidia compiler (formerly PGI) pm-cpu Perlmutter at NERSC (CPU-only nodes)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants