Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drain pdbs can softlock draining #14730

Open
Sunnatillo opened this issue Sep 17, 2024 · 0 comments
Open

Drain pdbs can softlock draining #14730

Sunnatillo opened this issue Sep 17, 2024 · 0 comments
Assignees
Labels

Comments

@Sunnatillo
Copy link
Contributor

Sunnatillo commented Sep 17, 2024

Rook version. v1.12.11 (edited)

Description:
Two drains where incorrectly detected on different failure domains caused rook to create two drain events protecting different nodes with pdbs and so causing softlock as now all nodes had pbs blocking them to be drained.

Events that caused the issue:

  1. node-3 was drained fine
  2. osd-0 on node-3 did not get healthy state in k8s api but it joined ceph cluster and was able to get all pgs active+clean
  3. rook selects to stop drain protection as cluster is active+clean
  4. pdbs are restored to default
  5. osd-0 is still detected as down by rook
  6. osd-0 becames active
  7. osd-2 is drained (as default pdb allows one pod to be down and osd-0 is not up) (node-1)
  8. rook does the checking of pgs for osd-0 and finds non-active pgs (because osd-2 caused them, but rook do not know that, as same pgs are also on other osds)
  9. rook creates pdbs to block other nodes than node-3
  10. rook detects osd-2 to be down.
  11. rook creates pdbs to protect osd-2
  12. softlock: all nodes are blocked and osd-2 is down and it will not come up until other osds on that node are drained, that will never happen because pdbs block it

Issues:

  1. getting k8s and ceph status is not atomic. So those do not fully reflect the reality.
  2. Two drains can block each others.

Possible solutions:
When setting pdbs because drain, it might be good to check if this node already has pdb blocking draining from it. Solution could be to add the pdb for other nodes and then clean the pdb matching to this node. This would clear this case but I am not sure if there are other cases where this could cause issues.
This might need check from k8s api to see if this failure domain is the only one that has failed osds before clearing the pdb for it. So that we do not accidentally cause multiple failures at same time.

Logs from the event:
2024-09-12 23:03:14.956779 I | clusterdisruption-controller: osd "rook-ceph-osd-0" is down but no node drain is detected
2024-09-12 23:03:15.285896 I | clusterdisruption-controller: osd is down in failure domain "node-3" is down for the last 21.94 minutes, but pgs are active+clean
2024-09-12 23:03:15.556659 I | clusterdisruption-controller: all PGs are active+clean. Restoring default OSD pdb settings
2024-09-12 23:03:15.556674 I | clusterdisruption-controller: creating the default pdb "rook-ceph-osd" with maxUnavailable=1 for all osd
2024-09-12 23:03:15.559175 I | clusterdisruption-controller: deleting temporary blocking pdb with "rook-ceph-osd-host-node-1" with maxUnavailable=0 for "host" failure domain "node-1"
2024-09-12 23:03:15.561094 I | clusterdisruption-controller: deleting temporary blocking pdb with "rook-ceph-osd-host-node-2" with maxUnavailable=0 for "host" failure domain "node-2"
2024-09-12 23:03:15.563269 E | clusterdisruption-controller: failed to update configMap "rook-ceph-pdbstatemap" in cluster "rook-ceph/rook-ceph-cluster": Operation cannot be fulfilled on configmaps "rook-ceph-pdbstatemap": the object has been modified; please apply your changes to the latest version and try again
2024-09-12 23:03:18.125054 I | clusterdisruption-controller: osd "rook-ceph-osd-0" is down but no node drain is detected
2024-09-12 23:03:18.461688 I | clusterdisruption-controller: osd is down in failure domain "node-3" and pgs are not active+clean. pg health: "cluster is not fully clean. PGs: [{StateName:active+clean Count:76} {StateName:peering Count:31} {StateName:active+undersized+degraded Count:22}]"
2024-09-12 23:03:20.899793 I | op-osd: updating OSD 2 on node "node-1"
2024-09-12 23:03:20.909820 I | ceph-cluster-controller: hot-plug cm watcher: running orchestration for namespace "rook-ceph" after device change
2024-09-12 23:03:20.915974 I | clusterdisruption-controller: creating temporary blocking pdb "rook-ceph-osd-host-node-1" with maxUnavailable=0 for "host" failure domain "node-1"
2024-09-12 23:03:20.920432 I | clusterdisruption-controller: creating temporary blocking pdb "rook-ceph-osd-host-node-2" with maxUnavailable=0 for "host" failure domain "node-2"
2024-09-12 23:03:20.920715 I | op-osd: updating OSD 5 on node "node-1"
2024-09-12 23:03:20.926573 I | clusterdisruption-controller: deleting the default pdb "rook-ceph-osd" with maxUnavailable=1 for all osd
2024-09-12 23:03:20.952527 E | clusterdisruption-controller: failed to update configMap "rook-ceph-pdbstatemap" in cluster "rook-ceph/rook-ceph-cluster": Operation cannot be fulfilled on configmaps "rook-ceph-pdbstatemap": the object has been modified; please apply your changes to the latest version and try again
2024-09-12 23:03:20.956665 I | clusterdisruption-controller: osd "rook-ceph-osd-2" is down and a possible node drain is detected
2024-09-12 23:03:20.982523 I | op-osd: waiting... 2 of 2 OSD prepare jobs have finished processing and 4 of 6 OSDs have been updated
2024-09-12 23:03:20.982537 I | op-osd: restarting watcher for OSD provisioning status ConfigMaps. the watcher closed the channel
2024-09-12 23:03:21.254436 E | ceph-spec: failed to update cluster condition to {Type:Ready Status:True Reason:ClusterCreated Message:Cluster created successfully LastHeartbeatTime:2024-09-12 23:03:21.248367805 +0000 UTC m=+1343.926592861 LastTransitionTime:2024-09-04 17:17:51 +0000 UTC}. failed to update object "rook-ceph/rook-ceph-cluster" status: Operation cannot be fulfilled on cephclusters.ceph.rook.io "rook-ceph-cluster": the object has been modified; please apply your changes to the latest version and try again
2024-09-12 23:03:21.293972 I | clusterdisruption-controller: osd is down in failure domain "node-1" and pgs are not active+clean. pg health: "cluster is not fully clean. PGs: [{StateName:active+clean Count:76} {StateName:active+undersized+degraded Count:53}]"
2024-09-12 23:03:22.673733 I | op-osd: OSD 1 is not ok-to-stop. will try updating it again later
2024-09-12 23:03:23.314053 I | clusterdisruption-controller: creating temporary blocking pdb "rook-ceph-osd-host-node-3" with maxUnavailable=0 for "host" failure domain "node-3"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants