-
Notifications
You must be signed in to change notification settings - Fork 39.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubelet: memorymanager static policy startup error #113130
Comments
/sig node |
/triage needs-information thank you for beautifully crafted error report!!! The k8s version you are running is old and out of support (kubelet v1.21.0). Would it be possible to check the behavior on latest version of k8s? I know that there is an existing limitation around cpu manager (not sure about memory manager) that requires to handle checkpoint file before the restart. Not sure if this applies here. Please check on latest k8s and report back. |
@SergeyKanzhelev |
/cc |
Hmm in general the memory manager should not allow allocating the same NUMA nodes for cross and single NUMA node allocations it is one of the limitations that we have, see https://kubernetes.io/blog/2021/08/11/kubernetes-1-22-feature-memory-manager-moves-to-beta/#single-vs-cross-numa-node-allocation. @gaohuatao-1 Which topology manager do you use? |
Thanks for your comment. |
I see, probably it will not be a problem when we will have both pods to be pinned to multiple NUMA nodes. And looks like you are correct under the comment #113130 (comment), we should validate the state in descending order. |
@gaohuatao-1 I had a similar problem and I submitted a PR to fix it, Can you help review the code? |
For a node group, we cannot restore the free and reserved size of each node in the group after multiple allocations and releases of one resource. |
In the third step, is the ideal case that numa1 equals 220 and numa2 equals 20? @gaohuatao-1 |
Yes, you are right. |
Thanks for your work, I will review it later. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten since we have an active PR I will move this to triaged /triage accepted |
This issue has not been updated in over 1 year, and should be re-triaged. You can:
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/ /remove-triage accepted |
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
What happened?
In our scenario, memoryManager are enabled. There are two numa nodes: node0 and node1 on the host.
The relevant parameter values are as follows:
memoryManagerPolicy: Static
Initially, no pods are running on this node with two numa node, 220G memory per numa node. Follow the steps bellow to create and delete pods:
create guaranteed Pod1 with one container, memory req and limit: 240G
create guaranteed Pod2 with one container, memory req and limit: 20G
At this point, machineState is as follows:
delete Pod2
At this point, machineState will be as follows:
create guaranteed Pod3 with one container, memory req and limit: 10G
At this point, actual machineState is as follows:
Now, restarting kubelet will fail. When kubelet restart, the expected machineState is as follows that is not equal to actual machineState above.
What did you expect to happen?
Pod creation and deletion order should not cause kubelet restart to fail.
How can we reproduce it (as minimally and precisely as possible)?
See analysis above.
Anything else we need to know?
No response
Kubernetes version
v1.21 and above versions have this problem
Cloud provider
OS version
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered: