Drain pdbs can softlock draining #14730

Sunnatillo · 2024-09-17T10:01:25Z

Rook version. v1.12.11 (edited)

Description:
Two drains where incorrectly detected on different failure domains caused rook to create two drain events protecting different nodes with pdbs and so causing softlock as now all nodes had pbs blocking them to be drained.

Events that caused the issue:

node-3 was drained fine
osd-0 on node-3 did not get healthy state in k8s api but it joined ceph cluster and was able to get all pgs active+clean
rook selects to stop drain protection as cluster is active+clean
pdbs are restored to default
osd-0 is still detected as down by rook
osd-0 becames active
osd-2 is drained (as default pdb allows one pod to be down and osd-0 is not up) (node-1)
rook does the checking of pgs for osd-0 and finds non-active pgs (because osd-2 caused them, but rook do not know that, as same pgs are also on other osds)
rook creates pdbs to block other nodes than node-3
rook detects osd-2 to be down.
rook creates pdbs to protect osd-2
softlock: all nodes are blocked and osd-2 is down and it will not come up until other osds on that node are drained, that will never happen because pdbs block it

Issues:

getting k8s and ceph status is not atomic. So those do not fully reflect the reality.
Two drains can block each others.

Possible solutions:
When setting pdbs because drain, it might be good to check if this node already has pdb blocking draining from it. Solution could be to add the pdb for other nodes and then clean the pdb matching to this node. This would clear this case but I am not sure if there are other cases where this could cause issues.
This might need check from k8s api to see if this failure domain is the only one that has failed osds before clearing the pdb for it. So that we do not accidentally cause multiple failures at same time.

Logs from the event:
2024-09-12 23:03:14.956779 I | clusterdisruption-controller: osd "rook-ceph-osd-0" is down but no node drain is detected
2024-09-12 23:03:15.285896 I | clusterdisruption-controller: osd is down in failure domain "node-3" is down for the last 21.94 minutes, but pgs are active+clean
2024-09-12 23:03:15.556659 I | clusterdisruption-controller: all PGs are active+clean. Restoring default OSD pdb settings
2024-09-12 23:03:15.556674 I | clusterdisruption-controller: creating the default pdb "rook-ceph-osd" with maxUnavailable=1 for all osd
2024-09-12 23:03:15.559175 I | clusterdisruption-controller: deleting temporary blocking pdb with "rook-ceph-osd-host-node-1" with maxUnavailable=0 for "host" failure domain "node-1"
2024-09-12 23:03:15.561094 I | clusterdisruption-controller: deleting temporary blocking pdb with "rook-ceph-osd-host-node-2" with maxUnavailable=0 for "host" failure domain "node-2"
2024-09-12 23:03:15.563269 E | clusterdisruption-controller: failed to update configMap "rook-ceph-pdbstatemap" in cluster "rook-ceph/rook-ceph-cluster": Operation cannot be fulfilled on configmaps "rook-ceph-pdbstatemap": the object has been modified; please apply your changes to the latest version and try again
2024-09-12 23:03:18.125054 I | clusterdisruption-controller: osd "rook-ceph-osd-0" is down but no node drain is detected
2024-09-12 23:03:18.461688 I | clusterdisruption-controller: osd is down in failure domain "node-3" and pgs are not active+clean. pg health: "cluster is not fully clean. PGs: [{StateName:active+clean Count:76} {StateName:peering Count:31} {StateName:active+undersized+degraded Count:22}]"
2024-09-12 23:03:20.899793 I | op-osd: updating OSD 2 on node "node-1"
2024-09-12 23:03:20.909820 I | ceph-cluster-controller: hot-plug cm watcher: running orchestration for namespace "rook-ceph" after device change
2024-09-12 23:03:20.915974 I | clusterdisruption-controller: creating temporary blocking pdb "rook-ceph-osd-host-node-1" with maxUnavailable=0 for "host" failure domain "node-1"
2024-09-12 23:03:20.920432 I | clusterdisruption-controller: creating temporary blocking pdb "rook-ceph-osd-host-node-2" with maxUnavailable=0 for "host" failure domain "node-2"
2024-09-12 23:03:20.920715 I | op-osd: updating OSD 5 on node "node-1"
2024-09-12 23:03:20.926573 I | clusterdisruption-controller: deleting the default pdb "rook-ceph-osd" with maxUnavailable=1 for all osd
2024-09-12 23:03:20.952527 E | clusterdisruption-controller: failed to update configMap "rook-ceph-pdbstatemap" in cluster "rook-ceph/rook-ceph-cluster": Operation cannot be fulfilled on configmaps "rook-ceph-pdbstatemap": the object has been modified; please apply your changes to the latest version and try again
2024-09-12 23:03:20.956665 I | clusterdisruption-controller: osd "rook-ceph-osd-2" is down and a possible node drain is detected
2024-09-12 23:03:20.982523 I | op-osd: waiting... 2 of 2 OSD prepare jobs have finished processing and 4 of 6 OSDs have been updated
2024-09-12 23:03:20.982537 I | op-osd: restarting watcher for OSD provisioning status ConfigMaps. the watcher closed the channel
2024-09-12 23:03:21.254436 E | ceph-spec: failed to update cluster condition to {Type:Ready Status:True Reason:ClusterCreated Message:Cluster created successfully LastHeartbeatTime:2024-09-12 23:03:21.248367805 +0000 UTC m=+1343.926592861 LastTransitionTime:2024-09-04 17:17:51 +0000 UTC}. failed to update object "rook-ceph/rook-ceph-cluster" status: Operation cannot be fulfilled on cephclusters.ceph.rook.io "rook-ceph-cluster": the object has been modified; please apply your changes to the latest version and try again
2024-09-12 23:03:21.293972 I | clusterdisruption-controller: osd is down in failure domain "node-1" and pgs are not active+clean. pg health: "cluster is not fully clean. PGs: [{StateName:active+clean Count:76} {StateName:active+undersized+degraded Count:53}]"
2024-09-12 23:03:22.673733 I | op-osd: OSD 1 is not ok-to-stop. will try updating it again later
2024-09-12 23:03:23.314053 I | clusterdisruption-controller: creating temporary blocking pdb "rook-ceph-osd-host-node-3" with maxUnavailable=0 for "host" failure domain "node-3"

Sunnatillo added the bug label Sep 17, 2024

travisn assigned sp98 Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drain pdbs can softlock draining #14730

Drain pdbs can softlock draining #14730

Sunnatillo commented Sep 17, 2024 •

edited

Loading

Drain pdbs can softlock draining #14730

Drain pdbs can softlock draining #14730

Comments

Sunnatillo commented Sep 17, 2024 • edited Loading

Sunnatillo commented Sep 17, 2024 •

edited

Loading