Cannot successfully create CephObjectStore after the store has been deleted. #14763

denppa · 2024-09-24T18:50:34Z

Bug Report

Deviation from expected behavior:

The rgw pod does not get created after the CephObjectStore CRDs has been deleted.

Expected behavior:

Spin up CephObjectStore with rgw pod even though the object store has been created before.

How to reproduce it (minimal and precise):

Create the following CRD:

apiVersion: ceph.rook.io/v1
kind: CephObjectStore
metadata:
  name: nextcloud-obj-store
  namespace: rook-ceph
spec:
  metadataPool:
    #failureDomain: host
    replicated:
      size: 3
  dataPool:
    #failureDomain: host
    replicated:
      size: 3
    quotas:
      maxSize: 5Ti
  preservePoolsOnDelete: true
  gateway:
    #sslCertificateRef:
    port: 80
    # securePort: 443
    instances: 1

kubectl create -f crd.yml

Then delete it:

kubectl delete -f crd.yml
# In a second shell
kubectl -n rook-ceph edit cephobjectstores.ceph.rook.io nextcloud-obj-store
# comment out the finalizer

Then create the obj store again, and it will fail. The Progress column of the below command will be Failure, and the Progressing and then loop. rgw pod will also fail to be created.

Monitor commands:

kubectl -n rook-ceph get pods --watch
kubectl -n rook-ceph get cephobjectstores.ceph.rook.io --watch

Logs to submit:

ceph-object-controller-detect-version-w995k         0/1     Pending           0             0s
ceph-object-controller-detect-version-w995k         0/1     Pending           0             0s
ceph-object-controller-detect-version-w995k         0/1     Init:0/1          0             0s
ceph-object-controller-detect-version-w995k         0/1     Init:0/1          0             0s
ceph-object-controller-detect-version-w995k         0/1     Init:0/1          0             1s
ceph-object-controller-detect-version-w995k         0/1     PodInitializing   0             3s
ceph-object-controller-detect-version-w995k         0/1     PodInitializing   0             3s
ceph-object-controller-detect-version-w995k         0/1     Terminating       0             3s
ceph-object-controller-detect-version-w995k         0/1     Terminating       0             4s
ceph-object-controller-detect-version-w995k         0/1     Terminating       0             4s
ceph-object-controller-detect-version-w995k         0/1     Completed         0             4s
ceph-object-controller-detect-version-w995k         0/1     Completed         0             5s
ceph-object-controller-detect-version-w995k         0/1     Completed         0

NAME                  PHASE         ENDPOINT                                                    SECUREENDPOINT   AGE
nextcloud-obj-store   Progressing   http://rook-ceph-rgw-nextcloud-obj-store.rook-ceph.svc:80                    43s
nextcloud-obj-store   Failure                                                                                    53s
nextcloud-obj-store   Progressing   http://rook-ceph-rgw-nextcloud-obj-store.rook-ceph.svc:80                    53s
nextcloud-obj-store   Failure

Operator's logs, if necessary

Sorry, operator restarted and logs cleared.

Cluster Status to submit:
All healthy.

Environment:

OS (e.g. from /etc/os-release): debian bookworm
Kernel (e.g. uname -a): Debian 6.1.106-3
Cloud provider or hardware configuration: baremetal
Rook version (use rook version inside of a Rook Pod):
Storage backend version (e.g. for ceph do ceph -v):
Kubernetes version (use kubectl version):

sh-5.1$ rook version
2024/09/24 18:47:01 maxprocs: Leaving GOMAXPROCS=96: CPU quota undefined
rook: v1.15.2
go: go1.22.7

Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): baremetal k8s
Storage backend status:

ceph version 18.2.4 ... reef (stable)

Temporary Solution

It will succeed if I delete every rgw service typed pool, including the .rgw.root pool.

The text was updated successfully, but these errors were encountered:

sp98 · 2024-09-25T11:01:59Z

@denppa Rook deletes the pools when objectstore CR is deleted. Can you please try to repro this again and share the rook operator logs?

BlaineEXE · 2024-09-25T16:45:13Z

If the finalizer is removed from the CephObjectStore resource while deleting, Rook will not be able to gracefully clean up the pools because they still could have user data in them. This can then prevent the CephObjectStore from being recreated. Likely, this is what occurred. If true, this is a matter of Rook working as intended to ensure user data safefty.

denppa · 2024-09-26T21:19:41Z

If the finalizer is removed from the CephObjectStore resource while deleting, Rook will not be able to gracefully clean up the pools because they still could have user data in them. This can then prevent the CephObjectStore from being recreated. Likely, this is what occurred. If true, this is a matter of Rook working as intended to ensure user data safefty.

Yes, as the command was hanging, I thought it was not able to delete the object store due to the finalizer. However, it makes sense that Rook disallows overwriting the pool to protect its data.

For future reference, wait for the delete command to complete gracefully to avoid this issue.

denppa added the bug label Sep 24, 2024

sp98 self-assigned this Sep 25, 2024

denppa closed this as completed Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot successfully create CephObjectStore after the store has been deleted. #14763

Cannot successfully create CephObjectStore after the store has been deleted. #14763

denppa commented Sep 24, 2024

sp98 commented Sep 25, 2024

BlaineEXE commented Sep 25, 2024

denppa commented Sep 26, 2024

Cannot successfully create CephObjectStore after the store has been deleted. #14763

Cannot successfully create CephObjectStore after the store has been deleted. #14763

Comments

denppa commented Sep 24, 2024

sp98 commented Sep 25, 2024

BlaineEXE commented Sep 25, 2024

denppa commented Sep 26, 2024